AITopics | csv file

Collaborating Authors

csv file

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

How Do LLMs Fail In Agentic Scenarios? A Qualitative Analysis of Success and Failure Scenarios of Various LLMs in Agentic Simulations

Roig, JV

arXiv.org Artificial IntelligenceDec-10-2025

We investigate how large language models (LLMs) fail when operating as autonomous agents with tool-use capabilities. Using the Kamiwaza Agentic Merit Index (KAMI) v0.1 benchmark, we analyze 900 execution traces from three representative models - Granite 4 Small, Llama 4 Maverick, and DeepSeek V3.1 - across filesystem, text extraction, CSV analysis, and SQL scenarios. Rather than focusing on aggregate scores, we perform fine-grained, per-trial behavioral analysis to surface the strategies that enable successful multi-step tool execution and the recurrent failure modes that undermine reliability. Our findings show that model scale alone does not predict agentic robustness: Llama 4 Maverick (400B) performs only marginally better than Granite 4 Small (32B) in some uncertainty-driven tasks, while DeepSeek V3.1's superior reliability derives primarily from post-training reinforcement learning rather than architecture or size. Across models, we identify four recurring failure archetypes: premature action without grounding, over-helpfulness that substitutes missing entities, vulnerability to distractor-induced context pollution, and fragile execution under load. These patterns highlight the need for agentic evaluation methods that emphasize interactive grounding, recovery behavior, and environment-aware adaptation, suggesting that reliable enterprise deployment requires not just stronger models but deliberate training and design choices that reinforce verification, constraint discovery, and adherence to source-of-truth data.

granite 4, large language model, machine learning, (21 more...)

arXiv.org Artificial Intelligence

2512.07497

Genre: Research Report > New Finding (0.86)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Reconstructing Trust Embeddings from Siamese Trust Scores: A Direct-Sum Approach with Fixed-Point Semantics

Alpay, Faruk, Alpay, Taylan, Kilictas, Bugra

arXiv.org Artificial IntelligenceAug-5-2025

We study the inverse problem of reconstructing high-dimensional trust embeddings from the one-dimensional Siamese trust scores that many distributed-security frameworks expose. Starting from two independent agents that publish time-stamped similarity scores for the same set of devices, we formalise the estimation task, derive an explicit direct-sum estimator that concatenates paired score series with four moment features, and prove that the resulting reconstruction map admits a unique fixed point under a contraction argument rooted in Banach theory. A suite of synthetic benchmarks (20 devices x 10 time steps) confirms that, even in the presence of Gaussian noise, the recovered embeddings preserve inter-device geometry as measured by Euclidean and cosine metrics; we complement these experiments with non-asymptotic error bounds that link reconstruction accuracy to score-sequence length. Beyond methodology, the paper demonstrates a practical privacy risk: publishing granular trust scores can leak latent behavioural information about both devices and evaluation models. We therefore discuss counter-measures -- score quantisation, calibrated noise, obfuscated embedding spaces -- and situate them within wider debates on transparency versus confidentiality in networked AI systems. All datasets, reproduction scripts and extended proofs accompany the submission so that results can be verified without proprietary code.

data mining, machine learning, trust score, (21 more...)

arXiv.org Artificial Intelligence

2508.01479

Country:

Europe (0.28)
Asia > Middle East > Republic of Türkiye (0.28)

Genre: Research Report > New Finding (1.00)

Industry: Information Technology > Security & Privacy (0.67)

Technology:

Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
(3 more...)

Add feedback

Consumer-grade EEG-based Eye Tracking

Afonso, Tiago Vasconcelos, Heinrichs, Florian

arXiv.org Artificial IntelligenceMar-18-2025

EEG-based eye tracking (ET) is emerging as a promising application of brain-computer interfaces (BCIs) (Dietrich et al., 2017; Fuhl et al., 2023; Kastrati et al., 2021; Sun et al., 2023). While EEG is typically used to record the electrical activity of the brain, it also captures eye movement artifacts due to the inherent electrical charge of the eyes. Although these signals are usually considered noise in other BCI applications and are often removed (Croft and Barry, 2000), they can be effectively used to track eye movements. These signals are also easier to decode than brain activity, as they are not complicated by the complexity and noise associated with brain signal interpretation. In addition, achieving reliable and accurate eye tracking using EEG technology could significantly enhance existing consumer BCIs, opening up a wide range of new applications. Apart from the potential for BCI applications, EEG-based eye tracking is an interesting alternative to eye tracking in its own right, offering several advantages over camera-based eye tracking, which is the predominant method used for eye tracking today.

artificial intelligence, machine learning, participant, (14 more...)

arXiv.org Artificial Intelligence

2503.14322

Country:

Europe > Germany > Hesse > Darmstadt Region > Darmstadt (0.04)
North America > United States > Hawaii (0.04)
North America > Trinidad and Tobago > Trinidad > Arima > Arima (0.04)

Genre: Research Report (0.64)

Industry: Health & Medicine > Therapeutic Area > Neurology (1.00)

Technology:

Information Technology > Human Computer Interaction > Interfaces (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.46)
Information Technology > Artificial Intelligence > Cognitive Science > Neuroscience (0.34)

Add feedback

DataSciBench: An LLM Agent Benchmark for Data Science

Zhang, Dan, Zhoubian, Sining, Cai, Min, Li, Fengzu, Yang, Lekang, Wang, Wei, Dong, Tianjiao, Hu, Ziniu, Tang, Jie, Yue, Yisong

arXiv.org Artificial IntelligenceFeb-19-2025

This paper presents DataSciBench, a comprehensive benchmark for evaluating Large Language Model (LLM) capabilities in data science. Recent related benchmarks have primarily focused on single tasks, easily obtainable ground truth, and straightforward evaluation metrics, which limits the scope of tasks that can be evaluated. In contrast, DataSciBench is constructed based on a more comprehensive and curated collection of natural and challenging prompts for uncertain ground truth and evaluation metrics. We develop a semi-automated pipeline for generating ground truth (GT) and validating evaluation metrics. This pipeline utilizes and implements an LLM-based self-consistency and human verification strategy to produce accurate GT by leveraging collected prompts, predefined task types, and aggregate functions (metrics). Furthermore, we propose an innovative Task - Function - Code (TFC) framework to assess each code execution outcome based on precisely defined metrics and programmatic rules. Our experimental framework involves testing 6 API-based models, 8 open-source general models, and 9 open-source code generation models using the diverse set of prompts we have gathered. This approach aims to provide a more comprehensive and rigorous evaluation of LLMs in data science, revealing their strengths and weaknesses. Experimental results demonstrate that API-based models outperform open-sourced models on all metrics and Deepseek-Coder-33B-Instruct achieves the highest score among open-sourced models. We release all code and data at https://github.com/THUDM/DataSciBench.

arxiv preprint arxiv, benchmark, csv, (13 more...)

arXiv.org Artificial Intelligence

2502.13897

Country: North America > United States > California > Alameda County > Berkeley (0.04)

Genre: Research Report > New Finding (0.66)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

AutoFLUKA: A Large Language Model Based Framework for Automating Monte Carlo Simulations in FLUKA

Ndum, Zavier Ndum, Tao, Jian, Ford, John, Liu, Yang

arXiv.org Artificial IntelligenceOct-19-2024

Monte Carlo (MC) simulations, particularly using FLUKA, are essential for replicating real-world scenarios across scientific and engineering fields. Despite the robustness and versatility, FLUKA faces significant limitations in automation and integration with external post-processing tools, leading to workflows with a steep learning curve, which are time-consuming and prone to human errors. Traditional methods involving the use of shell and Python scripts, MATLAB, and Microsoft Excel require extensive manual intervention and lack flexibility, adding complexity to evolving scenarios. This study explores the potential of Large Language Models (LLMs) and AI agents to address these limitations. AI agents, integrate natural language processing with autonomous reasoning for decision-making and adaptive planning, making them ideal for automation. We introduce AutoFLUKA, an AI agent application developed using the LangChain Python Framework to automate typical MC simulation workflows in FLUKA. AutoFLUKA can modify FLUKA input files, execute simulations, and efficiently process results for visualization, significantly reducing human labor and error. Our case studies demonstrate that AutoFLUKA can handle both generalized and domain-specific cases, such as Microdosimetry, with an streamlined automated workflow, showcasing its scalability and flexibility. The study also highlights the potential of Retrieval Augmentation Generation (RAG) tools to act as virtual assistants for FLUKA, further improving user experience, time and efficiency. In conclusion, AutoFLUKA represents a significant advancement in automating MC simulation workflows, offering a robust solution to the inherent limitations. This innovation not only saves time and resources but also opens new paradigms for research and development in high energy physics, medical physics, nuclear engineering space and environmental science.

large language model, machine learning, natural language, (22 more...)

arXiv.org Artificial Intelligence

2410.15222

Country: North America > United States (0.67)

Genre: Workflow (1.00)

Industry:

Health & Medicine > Nuclear Medicine (1.00)
Energy > Power Industry > Utilities > Nuclear (1.00)
Health & Medicine > Therapeutic Area > Oncology (0.67)
Energy > Oil & Gas > Upstream (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

DA-Code: Agent Data Science Code Generation Benchmark for Large Language Models

Huang, Yiming, Luo, Jianwen, Yu, Yan, Zhang, Yitong, Lei, Fangyu, Wei, Yifan, He, Shizhu, Huang, Lifu, Liu, Xiao, Zhao, Jun, Liu, Kang

arXiv.org Artificial IntelligenceOct-10-2024

We introduce DA-Code, a code generation benchmark specifically designed to assess LLMs on agent-based data science tasks. This benchmark features three core elements: First, the tasks within DA-Code are inherently challenging, setting them apart from traditional code generation tasks and demanding advanced coding skills in grounding and planning. Second, examples in DA-Code are all based on real and diverse data, covering a wide range of complex data wrangling and analytics tasks. Third, to solve the tasks, the models must utilize complex data science programming languages, to perform intricate data processing and derive the answers. We set up the benchmark in a controllable and executable environment that aligns with real-world data analysis scenarios and is scalable. The annotators meticulously design the evaluation suite to ensure the accuracy and robustness of the evaluation. We develop the DA-Agent baseline. Experiments show that although the baseline performs better than other existing frameworks, using the current best LLMs achieves only 30.5% accuracy, leaving ample room for improvement. We release our benchmark at https://da-code-bench.github.io.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2410.07331

Country:

Europe > United Kingdom (0.14)
North America > United States > New York (0.04)
North America > United States > California > Yolo County > Davis (0.04)
Asia > China > Shanghai > Shanghai (0.04)

Genre: Research Report > New Finding (0.46)

Industry:

Information Technology (1.00)
Transportation > Ground > Road (0.67)
Transportation > Electric Vehicle (0.67)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(2 more...)

Add feedback

Text2Chart31: Instruction Tuning for Chart Generation with Automatic Feedback

Zadeh, Fatemeh Pesaran, Kim, Juyeon, Kim, Jin-Hwa, Kim, Gunhee

arXiv.org Artificial IntelligenceOct-5-2024

Large language models (LLMs) have demonstrated strong capabilities across various language tasks, notably through instruction-tuning methods. However, LLMs face challenges in visualizing complex, real-world data through charts and plots. Firstly, existing datasets rarely cover a full range of chart types, such as 3D, volumetric, and gridded charts. Secondly, supervised fine-tuning methods do not fully leverage the intricate relationships within rich datasets, including text, code, and figures. To address these challenges, we propose a hierarchical pipeline and a new dataset for chart generation. Our dataset, Text2Chart31, includes 31 unique plot types referring to the Matplotlib library, with 11.1K tuples of descriptions, code, data tables, and plots. Moreover, we introduce a reinforcement learning-based instruction tuning technique for chart generation tasks without requiring human feedback. Our experiments show that this approach significantly enhances the model performance, enabling smaller models to outperform larger open-source models and be comparable to state-of-the-art proprietary models in data visualization tasks. We make the code and dataset available at https://github.com/fatemehpesaran310/Text2Chart31.

dataset, plot type, preprint, (15 more...)

arXiv.org Artificial Intelligence

2410.04064

Country:

Asia > South Korea > Seoul > Seoul (0.04)
North America > United States > Michigan (0.04)
Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.51)

Add feedback

BabelBench: An Omni Benchmark for Code-Driven Analysis of Multimodal and Multistructured Data

Wang, Xuwu, Cui, Qiwen, Tao, Yunzhe, Wang, Yiran, Chai, Ziwei, Han, Xiaotian, Liu, Boyi, Yuan, Jianbo, Su, Jing, Wang, Guoyin, Liu, Tingkai, Chen, Liyu, Liu, Tianyi, Sun, Tao, Zhang, Yufeng, Zheng, Sirui, You, Quanzeng, Yang, Yang, Yang, Hongxia

arXiv.org Artificial IntelligenceOct-1-2024

Large language models (LLMs) have become increasingly pivotal across various domains, especially in handling complex data types. This includes structured data processing, as exemplified by ChartQA and ChatGPT-Ada, and multimodal unstructured data processing as seen in Visual Question Answering (VQA). These areas have attracted significant attention from both industry and academia. Despite this, there remains a lack of unified evaluation methodologies for these diverse data handling scenarios. In response, we introduce BabelBench, an innovative benchmark framework that evaluates the proficiency of LLMs in managing multimodal multistructured data with code execution. BabelBench incorporates a dataset comprising 247 meticulously curated problems that challenge the models with tasks in perception, commonsense reasoning, logical reasoning, and so on. Besides the basic capabilities of multimodal understanding, structured data processing as well as code generation, these tasks demand advanced capabilities in exploration, planning, reasoning and debugging. Our experimental findings on BabelBench indicate that even cutting-edge models like ChatGPT 4 exhibit substantial room for improvement. The insights derived from our comprehensive analysis offer valuable guidance for future research within the community. The benchmark data can be found at https://github.com/FFD8FFE/babelbench.

benchmark, reasoning, zhang, (16 more...)

arXiv.org Artificial Intelligence

2410.00773

Genre: Research Report (0.84)

Industry:

Information Technology > Software (0.74)
Media (0.46)
Leisure & Entertainment (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

DOO-RE: A dataset of ambient sensors in a meeting room for activity recognition

Kim, Hyunju, Kim, Geon, Lee, Taehoon, Kim, Kisoo, Lee, Dongman

arXiv.org Artificial IntelligenceJan-16-2024

With the advancement of IoT technology, recognizing user activities with machine learning methods is a promising way to provide various smart services to users. High-quality data with privacy protection is essential for deploying such services in the real world. Data streams from surrounding ambient sensors are well suited to the requirement. Existing ambient sensor datasets only support constrained private spaces and those for public spaces have yet to be explored despite growing interest in research on them. To meet this need, we build a dataset collected from a meeting room equipped with ambient sensors. The dataset, DOO-RE, includes data streams from various ambient sensor types such as Sound and Projector. Each sensor data stream is segmented into activity units and multiple annotators provide activity labels through a cross-validation annotation process to improve annotation quality. We finally obtain 9 types of activities. To our best knowledge, DOO-RE is the first dataset to support the recognition of both single and group activities in a real meeting room with reliable annotations.

dataset, meeting room, sensor, (15 more...)

arXiv.org Artificial Intelligence

2401.08962

Country:

North America > United States (0.04)
Asia > South Korea > Daejeon > Daejeon (0.04)

Genre: Research Report (1.00)

Industry:

Information Technology > Security & Privacy (1.00)
Health & Medicine (1.00)
Energy (0.93)
Information Technology > Smart Houses & Appliances (0.69)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Internet of Things (1.00)
Information Technology > Communications > Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

URANUS: Radio Frequency Tracking, Classification and Identification of Unmanned Aircraft Vehicles

Lofù, Domenico, Di Gennaro, Pietro, Tedeschi, Pietro, Di Noia, Tommaso, Di Sciascio, Eugenio

arXiv.org Artificial IntelligenceNov-15-2023

Safety and security issues for Critical Infrastructures are growing as attackers adopt drones as an attack vector flying in sensitive airspaces, such as airports, military bases, city centers, and crowded places. Despite the use of UAVs for logistics, shipping recreation activities, and commercial applications, their usage poses severe concerns to operators due to the violations and the invasions of the restricted airspaces. A cost-effective and real-time framework is needed to detect the presence of drones in such cases. In this contribution, we propose an efficient radio frequency-based detection framework called URANUS. We leverage real-time data provided by the Radio Frequency/Direction Finding system, and radars in order to detect, classify and identify drones (multi-copter and fixed-wings) invading no-drone zones. We adopt a Multilayer Perceptron neural network to identify and classify UAVs in real-time, with $90$% accuracy. For the tracking task, we use a Random Forest model to predict the position of a drone with an MSE $\approx0.29$, MAE $\approx0.04$, and $R^2\approx 0.93$. Furthermore, coordinate regression is performed using Universal Transverse Mercator coordinates to ensure high accuracy. Our analysis shows that URANUS is an ideal framework for identifying, classifying, and tracking UAVs that most Critical Infrastructure operators can adopt.

dataset, drone, scenario, (14 more...)

arXiv.org Artificial Intelligence

doi: 10.1109/OJVT.2023.3333676

2207.06025

Country:

North America > United States > California > San Diego County > San Diego (0.04)
Europe > Netherlands (0.04)
Europe > Italy > Apulia > Bari (0.04)
Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)

Genre: Research Report (0.82)

Industry:

Information Technology > Security & Privacy (1.00)
Government (1.00)
Transportation > Air (0.93)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles > Drones (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.93)

Add feedback